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Abstract 

Bayesian models offer great flexibility for clustering applications — Bayesian nonparametrics can be 
used for modeling infinite mixtures, and hierarchical Bayesian models can be utilized for sharing clusters 
across multiple data sets. For the most part, such flexibility is lacking in classical clustering methods such 
as k-means. In this paper, we revisit the k-means clustering algorithm from a Bayesian nonparametric 
viewpoint. Inspired by the asymptotic connection between k-means and mixtures of Gaussians, we show 
that a Gibbs sampling algorithm for the Dirichlet process mixture approaches a hard clustering algorithm in 
the limit, and further that the resulting algorithm monotonically minimizes an elegant underlying k-means- 
like clustering objective that includes a penalty for the number of clusters. We generalize this analysis 
to the case of clustering multiple data sets through a similar asymptotic argument with the hierarchical 
Dirichlet process. We also discuss further extensions that highlight the benefits of our analysis: i) a spectral 
relaxation involving thresholded eigenvectors, and ii) a normalized cut graph clustering algorithm that does 
not fix the number of clusters in the graph. 

1 Introduction 

There is now little debate that Bayesian statistics have had tremendous impact on the field of machine learn- 
ing. For the problem of clustering, the topic of this paper, the Bayesian approach allows for flexible models 
in a variety of settings. For instance. Latent Dirichlet Allocation |2|, a hierarchical mixture of multinomials, 
reshaped the topic modeling community and has become a standard tool in document analysis. Bayesian 
nonparametric models, such as the Dirichlet process mixture (8l, result in infinite mixture models which do 
not fix the number of clusters in the data upfront; these methods continue to gain popularity in the learning 
community. 

Yet despite the success and flexibility of the Bayesian framework, simpler methods such as k-means 
remain the preferred choice in many large-scale applications. For instance, in visual bag-of-words models |7 |, 
large collections of image patches are quantized, and k-means is universally employed for this task. A major 
motivation for using k-means is its simplicity and scalability: whereas Bayesian models require sampling 
algorithms or variational inference techniques which can be difficult to implement and are often not scalable, 
k-means is straightforward to implement and works well for a variety of applications. 

In this paper, we attempt to achieve the best of both worlds by designing scalable hard clustering al- 
gorithms from a Bayesian nonparametric viewpoint. Our approach is inspired by the connection between 
k-means and mixtures of Gaussians, namely that the k-means algorithm may be viewed as a limit of the 
expectation-maximization (EM) algorithm — if all of the covariance matrices corresponding to the clusters 
in a Gaussian mixture model are equal to a I and we let a go to zero, the EM steps approach the k-means 
steps in the limit. As we will show, in the case of a Dirichlet process (DP) mixture model — the standard 
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Bayesian nonparametric mixture model — we can perform a similar limiting argument in the context of a 
simple Gibbs sampler This leads to an algorithm with hard cluster assignments which is very similar to the 
classical k-means algorithm except that a new cluster is formed whenever a point is sufficiently far away 
from all existing cluster centroids. Further, we show that this algorithm monotonically converges to a local 
optimum of a k-means-like objective which includes a penalty for the number of clusters. 

We then take a step further into the realm of hierarchical Bayesian models, and extend our analysis to the 
hierarchical Dirichlet process (HDP) |fT6l . The HDP is a model for shared clusters across multiple data sets; 
when we take an analogous asymptotic limit for the HDP mixture, we obtain a novel k-means-like algorithm 
that clusters multiple data sets with shared cluster structure. The resulting algorithm clusters each data set into 
local clusters, but local clusters are shared across data sets to form global clusters. The underlying objective 
function in this case turns out to be the k-means objective with additional penalties for the number of local 
clusters and the number of global clusters. 

To further demonstrate the practical value of our approach, we present two additional extensions. First, we 
show that there is a spectral relaxation for the k-means-like objective arising from the DP mixture. Unlike the 
standard relaxation for k-means, which computes the top-k eigenvectors, our relaxation involves computing 
eigenvectors corresponding to eigenvalues above a threshold, and highlights an interesting connection be- 
tween spectral methods and Bayesian nonparametrics. Second, given existing connections between k-means 
and graph clustering, we propose a penalized normalized cut objective for graph clustering, and utilize our 
earlier results to design an algorithm for monotonic optimization. Unlike the standard normalized cut for- 
mulation fTT.TSl, our formulation does not fix the number of clusters in the graph. We conclude with some 
results highlighting that our approach retains the flexibility of the Bayesian models while featuring the scal- 
ability of the classical techniques. Ultimately, we hope that this line of work will inspire additional research 
on the integration of Bayesian nonparametrics and hard clustering methods. 



2 Background 

We begin with a short discussion of the relevant models and algorithms considered in this work: mixtures of 
Gaussians, k-means, and DP mixtures. 

2.1 Gaussian Mixture Models and k-means 

In a (finite) Gaussian mixture model, we assume that data arises from the following distribution: 

k 

where k is the fixed number of components, tTc are the mixing coefficients, and /ic and Ec are the means and 
covariances, respectively, of the k Gaussian distributions. In the non-Bayesian setting, we can use the EM 
algorithm to perform maximum likelihood given a set of observations a^i, a;„. Briefly, we initialize the 
means /Xc, covariances Ec, and mixing coefficients tTc. Then we alternate between the E-step and M-step. In 
the E-step, using the current parameter values, we compute the following quantities for alH = 1, n and 
for all c = 1, /c: 

, . _ T^MiXi I Mc,Se) 
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In the M-step, we re-estimate the parameters using the values of j{zic): 

n 



where ric = X]"=i 7(^ic)- One can show that the EM algorithm converges to a local optimum of the log 
likelihood function. Note that the values 7(zic) are the probabilities of assigning point Xi to cluster c, and so 
the resulting clustering is a soft clustering of the data. 

A related model for clustering is provided by the k-means objective function, an objective for discovering 
a hard clustering of the data. Given a set of data points a;i , Xn, the k-means objective function attempts to 
find clusters ^i, ^/t to minimize the following objective function: 

min Ec=iExe^JI^-Mc|ll 

where = |^ Exe£<= ^■ 

The most popular method for minimizing this objective function is simply called the k-means algorithm. One 
initializes the algorithm with a hard clustering of the data along with the cluster means of these clusters. 
Then the algorithm alternates between reassigning points to clusters and recomputing the means. For the 
reassignment step one computes the squared Euclidean distance from each point to each cluster mean, and 
finds the minimum, by computing = argmin^||a;i — Mc|l2- Each point is then reassigned to the cluster 
indexed by The centroid update step of the algorithm then recomputes the mean of each cluster, 

updating /Xc for all c. 

The EM algorithm for mixtures of Gaussians is quite similar to the k-means algorithm. Indeed, one can 
show a precise connection between the two algorithms. Suppose in the mixture of Gaussians model that all 
Gaussians have the same fixed covariance equal to al. Because they are fixed, the covariances need not be 
re-estimated during the M-step. In this case, the E-step takes the following form: 

, X _ 7rc-exp(- ^||a:i - Mclli) 

Xij^iTTj •exp(- 2^||a;i-/Xj||^) 

It is straightforward to show that, in the limit as cr 0, the value of ^{zi^ approaches zero for all c except 
for the one corresponding to the smallest distance ||a;i — /Xc||2- In this case, the E-step is equivalent to the 
reassignment step of k-means, and one can further easily show that the M-step exactly recomputes the means 
of the new clusters, establishing the equivalence of the updates. We also note that farther interesting con- 
nections between k-means and probabilistic clustering models were explored in |10|. Though they approach 
the problem differently (i.e., not from an asymptotic view), the authors also ultimately obtain k-means-like 
algorithms that can be applied in the nonparametric setting. 



2.2 Dirichlet Process Mixture Models 

We briefly review DP mixture models |8|. We can equivalently write the standard Gaussian mixture as a 
generative model where one chooses a cluster with probability tTc and then generates an observation from the 
Gaussian corresponding to the chosen cluster The distribution over the cluster indicators follows a discrete 
distribution, so a Bayesian extension to the mixture model arises by first placing a Dirichlet prior of dimension 
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k, Dir(A:, ttq), on the mixing coefficients, for some ttq. If we further assume that the covariances of the 
Gaussians are fixed to al and that the means are drawn from some prior distribution Gq, we obtain the 
following Bayesian model: 



Zi , . . . , Zfi 



TV 



n 



Go 

Dir(fc,7ro) 
Discrete(7r) 



letting TT = (tti, TTfe). One way to view the DP mixture model is to take a limit of the above model as 
fc — > oo when choosing ttq = {a/k)e, where e is the vector of all ones. One of the simplest algorithms 
for inference in a DP mixture is based on Gibbs sampling; this approach was utilized by |17| and further 
discussed by ifTSl . Algorithm 2. The state of the underlying Markov chain consists of the set of all cluster 
indicators and the set of all cluster means. The algorithm proceeds by first looping repeatedly through each 
of the data points and performing Gibbs moves on the cluster indicators for each point. For i = 1, n, we 
reassign Xi to existing cluster c with probability n_i c • Af{xi \ fj,c, <jI)/Z, where n_i.c is the number of data 
points (excluding Xi) that are assigned to cluster c. With probability 



we start a new cluster Z is an appropriate normalizing constant. If we end up choosing to start a new cluster, 
we select its mean from the posterior distribution obtained from the prior Gq and the single sample Xi. After 
resampling all clusters, we perform Gibbs moves on the means: we sample fj,c given all points currently 
assigned to cluster c. Vc. 

We note that one often writes the DP mixture model (adapted to our Gaussian mixture scenario) as fol- 
lows: 



Thus, each G is a draw from the Dirichlet process DP(Go, a), whose base measure Go is a prior over means 
of the Gaussians. We can think of a draw from G as choosing one of the infinite means fic drawn from Gq, 
with the property that the means are chosen with probability equal to the corresponding mixing weights. As 
a result, each is equal to fi^ for some c. 

3 Hard Clustering via Dirichlet Processes 

In the following sections, we derive hard clustering algorithms based on DP mixture models. We will analyze 
properties of the resulting algorithms and show connections to existing hard clustering algorithms, particu- 
larly k-means. 

3.1 Asymptotics of the DP Gibbs Sampler 

Using the DP mixture model introduced in the previous section, let us first define Gq. Since we are fixing the 
covariances, Gq is the prior distribution over the means, which we will take to be a zero-mean Gaussian with 
pi covariance, i.e., fj, ^ Af{0, pi). Given this prior, the Gibbs probabilities can be computed in closed form. 
A straightforward calculation reveals that the probability of starting a new cluster is equal to: 




G 





4 



Similarly, the probability of being assigned to cluster c is equal to 



Z ' ' "V 2cr 

Z normalizes these probabilities to sum to 1 . We now would like to see what happens to these probabilities 
as cr — > 0. However, in order to obtain non-trivial assignments, we must additionally let a be a function of a 
and p. In particular, we will write a = (1 + pjaYI"^ ■ exp(— ^) for some A. Now, let 7(zic) correspond to 
the posterior probability of point i being assigned to cluster c and let ^{zi^new) be the posterior probability 
of starting a new cluster. After simplifying, we obtain the following probabilities to be used during Gibbs 
sampling: 7(2^^) = 



for existing clusters and ^(zi^new) — 



exp 



2cr 2(p4 



for generating a new cluster Now we consider the asymptotic behavior of the above probabilities. The 
numerator for ^{zi^new) can be written as 



exp 



1 

2^ 



A + 



It is straightforward to see that, as with a fixed p, the A term dominates this numerator. Furthermore, 
all of the above probabilities will become binary; in particular, the values of 7(zi,c) ™d 7(2^ „ew) will be 
increasingly dominated by the smallest value of {\\xi — \\xi — ^lk\\'^ , A}. In the limit, only the 

smallest of these values will receive a non-zero 7 value. The resulting update, therefore, takes a simple form 
that is analogous to the k-means cluster reassignment step. We reassign a point to the cluster corresponding 
to the closest mean, unless the closest cluster has squared EucUdean distance greater than A. In this case, we 
start a new cluster 

If we choose to start a new cluster, the final step is to sample a new mean from the posterior based on 
the prior Go and the single observation Xi. Similarly, once we have performed Gibbs moves on the cluster 
assignments, we must perform Gibbs moves on all the means, which amounts to sampling from the posterior 
based on Gq and all observations in a cluster Since the prior and likelihood are Gaussian, the posterior will 
be Gaussian as well. If we let Xc be the mean of the points currently assigned to cluster c and Uc be the 
number of points assigned to cluster c, then the posterior is a Gaussian with mean fic and covariance 
where 

/ (T \ _ ~ ap 

= IH Xc, Sc = — , 1- 

\ pnc J cr + pric 

As before, we consider the asymptotic behavior of the above Gaussian distribution as cr — > 0. The mean 
of the Gaussian approaches and the covariance goes to zero, meaning that the mass of the distribution 
becomes concentrated at x^- Thus, in the limit we choose x^. as the mean. 

Putting everything together, we obtain a hard clustering algorithm that behaves similarly to k-means with 
the exception that a new cluster is formed whenever a point is farther than A away from every existing cluster 
centroid. We choose to initialize the algorithm with a single cluster whose mean is simply the global centroid; 
the resulting algorithm is specified as Algorithm [T] which we denote as the DP-means algorithm. Note that, 
unlike standard k-means, which depends on the initial clustering of the data, the DP-means algorithm depends 
on the order in which data points are processed. One area of future work would consider adaptive methods 
for choosing an ordering. 
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Algorithm 1 DP-means 



Input: cci, a;„: input data, A : cluster penalty parameter 
Output: Clustering li, ...,lk and number of clusters k 

1. Init. k = l,£i — {xi , Xn} and /xi the global mean. 

2. Init. cluster indicators Zi — 1 for alH = 1, n. 

3. Repeat until convergence 

• For each point Xi 

- Compute die ~ \\xi — /J.c||^ for c = 1, k 

- If mine die > \, set k = k + 1, Zi — k, and fik ~ Xi. 

- Otherwise, set Zi — argmin^ d^c. 

• Generate clusters l\, based on , . . . , : Ij = {xi \ Zi = j}- 

• For each cluster ij, compute jjLj = X^xe* 



3.2 Underlying Objective and the AIC 

With the procedure from the previous section in hand, we can now analyze its properties. A natural question 
to ask is whether there exists an underlying objective function corresponding to this k-means-like algorithm. 
In this section, we show that the algorithm monotonically decreases the following objective at each iteration, 
where an iteration is defined as a complete loop through all data points to update all cluster assignments and 
means: 

l^c J c=l 

where = Exef. ^- (1) 

This objective is simply the k-means objective function with an additional penalty based on the number of 
clusters. The threshold A controls the tradeoff between the traditional k-means term and the cluster penalty 
term. We can prove the following: 

Theorem 3.1. Algorithm^monotonically decreases the objective given in Q until local convergence. 

Proof. The proof follows a similar argument as the proof for standard k-means. The reassignment step results 
in a non-increasing objective since the distance between a point and its newly assigned cluster mean never 
increases; for distances greater than A, we can generate a new cluster and pay a penalty of A while still 
decreasing the objective. Similarly, the mean update step results in a non-increasing objective since the mean 
is the best representative of a cluster in terms of the squared Euclidean distance. The fact that the algorithm 
will converge locally follows from the fact that the objective function cannot increase, and that there are only 
a finite number of possible clusterings of the data. □ 

Perhaps unsurprisingly, this objective has been studied in the past in conjunction with the Akaike Infor- 
mation Criterion (AIC). For instance, [H] describe the above penalized k-means objective function with a 
motivation arising from the AIC. Interestingly, it does not appear that algorithms have been derived from this 
particular objective function, so our analysis seemingly provides the first constructive algorithm for mono- 
tonic local convergence as well as highlighting the connections to the DP mixture model. Finally, in the case 
of k-means, one can show that the complete-data log likelihood approaches the k-means objective in the limit 
as (T ^ 0. We conjecture that a similar result holds for the DP mixture model, which would indicate that our 
result is not specific to the particular choice of the Gibbs sampler 



6 




Figure 1: Overview of clustering with multiple data sets. Each data set has some number of local clusters, and 
each local cluster is associated with some global mean fip. Each global mean fip is computed as the mean 
of all points (across all data sets) associated with that global cluster When reassigning points to clusters, the 
objective function penalizes the formation of either a new local cluster or a new global cluster See text for 
details. 

4 Clustering with Multiple Data Sets 

One of the most useful extensions to the standard DP mixture model arises when we introduce another DP 
layer on top of the base measure. Briefly, assume we have a set of data sets, each of which is modeled as 
a DP mixture. However, instead of defining the base measure of each DP mixture using Gq, the prior over 
the means, we instead let Gq itself be a Dirichlet process whose base measure is a prior over the means. The 
result is that, given a collection of data sets, we can cluster each data set while ensuring that the clusters across 
the data sets are shared appropriately. We will not describe the resulting hierarchical Dirichlet process (HDP) 
nor its corresponding sampling algorithms in detail, but we refer the reader to lfT6l for a detailed introduction 
to the HDP model and a description of inference techniques. We will see that the limiting process described 
earlier for the standard DP can be straightforwardly extended to the HDP; we will outline the algorithm below, 
and Figure [T] gives an overview of the approach. 

To set the stage, let us assume that we have D data sets, 1, j, D. Denote Xij to be data point i from 
data set j, and let there be Uj data points from each data set j. The basic idea is that we will locally cluster 
the data points from each data set, but that some cluster means will be shared across data sets. Each data set 
7 has a set of local cluster indicators given by Zij such that Zij = c if data point i in data set j is assigned to 
local cluster Sjc- Each local cluster Sjc is associated to a global cluster mean fip. 

Recall the standard form for the DP mixture model under our settings: 

G - DP(a,Go) 

^ G for i = 1 , . . . , n 

Xi ^ J\f{(t)i,<Tl) for i — 1, ...,n. 

For the HDP, we have a set of data sets indexed by j, each of which is a DP mixture. However, instead of 
defining the base measure of each DP mixture using Go, the prior over the means, we instead let Go itself be 
a Dirichlet process whose base measure is a prior over the means. This yields the following: 

DP(7,i/) 

DP(a,Go) forj = l,...,D 
Gj for all i, j 

N{(j)ij , al) for all i, j. 

Analogous to the standard DP mixture, the (t)ij chooses some global mean /i.c, now based on both the local 



Go ^ 
G, ^ 
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and global Dirichlet processes Gj and Go, respectively. The prior over the means of the Gaussian is now 
specified by H. 

4.1 The Hard Gaussian HDP 

We can now extend the asymptotic argument that we employed for the hard DP algorithm to the HDP. We 
will summarize the resulting algorithm; the derivation is analogous to the derivation for the single DP mixture 
case. As with the hard DP algorithm, we will have a threshold that determines when to introduce a new 
cluster. For the hard HDP, we will require two parameters: let be the "local" threshold parameter, and Xg 
be the "global" threshold parameter The algorithm works as follows: for each data point Xij, we compute 
the distance to every global cluster fip. For any global cluster p for which there is no current association in 
data set j, we add a penalty of to the distance (intuitively, this penalty captures the fact that if we end up 
assigning Xij to a global cluster that is not currently in use by data set j, we will incur a penalty of Xe to create 
a new local cluster, which we only want to do if the cluster if sufficiently close to Xij). We reassign each data 
point Xij to its nearest cluster, unless the closest distance is greater than Xi + Xg, in which case we start a new 
global cluster (in this case we are starting a new local cluster and a new global cluster, hence the sum of the 
two penalties). Then, for each local cluster, we consider whether to reassign it to a different global mean: for 
each local cluster Sjc, we compute the sum of distances of the points to every fip. We reassign the association 
of Sjc to the corresponding closest fip; if the closest is farther than Xg plus the sum of distances to the local 
cluster mean, then we start a new global cluster whose mean is the local mean. Finally, we recompute all 
means fip by computing the mean of all points (over all data sets) associated to each fip. See Algorithm |2] 
for the full specification of the procedure; the algorithm is derived directly as an asymptotic hard clustering 
algorithm based on the Gibbs sampler for the HDP. 

As with the DP-means algorithm, we can determine the underlying objective function, and use it to 
determine convergence. Let k = X]jLi % be the total number of local clusters, and g be the total number of 
global clusters. Then we can show that the objective optimized is the following: 

, min Ep=i T,^,,ee^ W^^J " f^pWl + + ^99, 

i^PSp=l 

where t^P = ]t\ S=^..e^P 

This objective is pleasantly simple and intuitive: we minimize the global k-means objective function, but we 
incorporate a penalty whenever either a new local cluster or a new global cluster is created. With appropriately 
chosen Xi and Xg, the result is that we obtain sharing of cluster structure across data sets. We can prove that 
the hard Gaussian HDP algorithm monotonically minimizes this objective (the proof is similar to Theorem 
3.1). 

Theorem 4.1. Algorithm^monotonically minimizes the objective Q until local convergence. 

5 Further Extensions 

We now discuss two additional extensions of the proposed objective: a spectral relaxation for the proposed 
hard clustering method and a normalized cut algorithm that does not fix the number of clusters in the graph. 

5.1 Spectral Meets Nonparametric 

Recall that spectral clustering algorithms for k-means are based on the observation that the k-means objective 
can be relaxed to a problem where the globally optimal solution may be computed via eigenvectors. In par- 
ticular, for the k-means objective, one computes the eigenvectors corresponding to the k largest eigenvalues 
of the kernel matrix K over the data; these eigenvectors form the globally optimal "relaxed" cluster indicator 
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Algorithm 2 Hard Gaussian HDP 



Input: {xij}: input data, \e : local cluster penalty parameter, Ag! global cluster penalty parameter 
Output: Global clustering l\, ...,lg and number of clusters kj for all data sets j 

1. Initialize g = 1, kj — 1 for all j and /ii to be the global mean across all data sets. 

2. Initialize local cluster indicators Zij = 1 for all i and j, and global cluster associations vji — 1 for all j. 

3. Repeat steps 4-6 until convergence: 

4. For each point Xij : 

• Compute dijp = \\xij — fj,p\\^ foi p — 1, g. 

• For allp such that Vjc 7^ p for all c — 1, kj, set dijp = dijp + A^. 

• If 

- Set kj = fcj + 1, Zij = kj, g = g + 1, fig = Xij, and Vjk^ = 5. 

• Else letp = argmiUpdijp. 

- If Vjc ~ p for some c, set 2^ = c and Vjc = p. 

- Else, set kj — kj + 1, Zij = kj, and Ujfe^ = p. 

5. For all local clusters: 

• Let Sjc = {xij\zij = c} and /ijc = jsj^ E^ies,, ^J- 

• Compute djcp = I]^gs^.^ ||a; - Mp|P forp = 1, ...,g. 

• If minp Jjcp > Aj, + I]^gs^.^ ||a; - /ijc|P, set g = 3 + 1, Vjc = 3, and fig = p,jc. 

• Else set Vjc = argmin^dicp. 

6. For each global cluster p — 1, ...,g, re-compute means: 

• Let £p = {xij\zij — c and Vjc = p}. 

• Compute /^p = E^e<?^ a:- 



matrix 1 19|. A clustering of the data is obtained by suitably post-processing the eigenvectors, e.g., clustering 
via k-means. 

In a similar manner, in this section we will show that the globally optimal solution to a relaxed DP- 
means objective function is obtained by computing the eigenvectors of the kernel matrix corresponding to all 
eigenvalues greater than A, and stacking these into a matrix. To prove the correctness of this relaxation, let 
us denote Z as the n x k cluster indicator matrix whose rows correspond to the cluster indicator variables 
Zic- Let Y — Z{Z'^ Z)^^/^ be a normalized indicator matrix, and notice that Y^Y = I. We can prove the 
following lemma. 

Lemma 5.1. The DP-means objective function can equivalently be written as vascKY tT{Y^ {K — XI)Y), 
where the optimization is performed over tlie space of all normalized indicator matrices Y. 

Proof. It was shown in lfT9l that the standard k-means function can be expressed as a trace maximization of 
tr{Y^ KY), over the space of normalized indicator matrices Y . Noting that ir{Y'^ {\I)Y) = Afc as F is an 
orthogonal n x k matrix, the lemma follows. □ 

Now we perform a standard spectral relaxation by relaxing the optimization to be over all orthonormal 
matrices Y: 

max tiiY^(K - \I)Y). (3) 

{Y I ¥-^¥=1} 

Using standard arguments, one can show the following result: 
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Theorem 5.2. By relaxing the cluster indicator matrix Y to be any orthonormal matrix, the optimal Y in the 
relaxed clustering objective Q is obtained by forming a matrix of all eigenvectors of K whose corresponding 
eigenvalues are greater than A. 

Using this result, one can design a simple spectral algorithm that computes the relaxed cluster indicator 
matrix Y , and then clusters the rows of Y, as is common for spectral clustering methods. Thus, the main 
difference between a standard spectral relaxation for k-means and the DP-means is that, for the former, we 
take the top-fc eigenvectors, while for the latter, we take all eigenvectors corresponding to eigenvalues greater 
than A. 



5.2 Graph Clustering 

It is also possible to develop extensions to the DP-means algorithm for graph cut problems such as normaUzed 
and ratio cut. 

We briefly review connections between k-means and graph clustering. We first note that it is straightfor- 
ward to apply k-means in kernel space. Suppose the data has been mapped via some function ijj so that the 
data points in feature space are -0(iCi), il]{x2), Further suppose we can compute a kernel function 
K{xi, Xj) — ip{xi)'^ip{xj) without explicitly computing the function ip. Note that the computation between 
a data point and any cluster mean can be expressed in kernel space by expanding the squared Euclidean 
distance: 

= {4'{x) - ^lef{'^p{x) - 

= il;{x)'^ il){x) - 2%l){x)'^ lie + He 



ipix) ip{x) — 
k{x, x) 



141 I4P 



When computing distances via the above method, it is unnecessary to explicitly compute the means of the 
clusters. Therefore, the resulting kernel k-means algorithm does not have an explicit mean re-estimation step; 
instead, the distances to each implicit cluster mean are computed in kernel space and then each point is re- 
assigned to the cluster corresponding to the nearest implicit cluster mean. This is repeated until convergence. 
Note that the DP-means algorithm and the hard Gaussian HDP can also easily be applied in kernel space. 

Another extension of the k-means objective is to introduce a weight Wi for each point, and to minimize a 
weighted form of the k-means objective function: 

min Ec=iExef,^i'J^»-/^c||i 
where fic = 



We can now make a connection between the k-means objective function and graph clustering ll3l [141 [TSl . 
Given a graph G = (V, £, A), where V is a set of vertices, £ a set of edges, and A is the underlying adjacency 
matrix for the graph, various methods have been proposed to cluster the vertices in the graph into a disjoint 
collection of clusters. Two popular criteria for the graph clustering problem are the ratio cut and normaUzed 
cut. In the ratio cut, one seeks a clustering Vi, Vfc of the vertices to minimize the following objective: 

. A links(Vc, V \ Vc) . ^ , 

min > V , ^ \ (Ratio Cut) 
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Figure 2: Synthetic results demonstrating advantages of our method, a) A simple data set of 3 Gaussians. b) 
NMI scores over the first 5000 Gibbs iterations — in contrast, across 100 runs, DP-means always converges 
within 8 iterations on this data, always returns 3 clusters, and yields an average NMI of .89. c) Number 
of clusters in DP-means as a function of lambda, d) One of the 50 data sets for the hard Gaussian HDP 
experiment. See text for details. 



where links (;B,C) = X^ieB ieC - Thus, the ratio cut criterion attempts to find clusters of vertices such that 
the "cut" from clusters to remaining nodes in the graph (normalized by the size of the clusters), is minimized. 
The related normalized cut problem minimizes the following: 

. A links(Vc, V \ Vc) ,. . 

min > — (Normalized Cut) 

Vi,...,v,^ deg(Vc) 

where deg(S) = links(S, V) (or equivalently, the sum of the degrees of the vertices in B). 
We state a result proven in [5] for standard k-way normalized cut. 

Theorem 5.3. Let J{K, W) be the weighted kernel k-means objective with kernel matrix K and (diagonal) 
weight matrix W, and let Cut{A) be the k-way normalized cut objective with adjacency matrix A. Let D be 
the diagonal degree matrix corresponding to A(D ^ diag{ Ae)). Then the following relationship holds: 

J{K, W)^cm + tr{D-^/'^AD-^/^) - (cr + l)fc + Cut{A), 

when we define K = aD^^ + D^^ AD^^ , W — D, and a is large enough that K is positive semi-definite. 
A similar result holds for ratio cut. 

Let the DP-means objective — easily extended to kernel space and to use weights — be given by J{K, W) + 
Xk, and let the analogous penalized normalized cut objective be given by Cut(A) + X'k. Letting an + 
tr(D"i/2ylL>"i/2) = C, a constant, we have that J{K, W) + Xk = 

C + Cut{A) - (o- + l)fc + Xk = C + Cut{A) + X'k, 

where A' — A — a— 1. Thus, optimizing the hard DP weighted kernel k-means objective with model parameter 
A is equivalent to optimizing the penalized normalized cut objective with model parameter A' = A — a — 1, 
and with the construction of K and W as in the above theorem. Utilizing the results of f5|, one can show 
that the distance between a node and a cluster mean can be performed in 0(|£'|) time. A straightforward 
extension of Algorithm[T]can then be adapted for the above penalized normalized cut objective. 



6 Experiments 

We conclude with a brief set of experiments to demonstrate the utility of our approach. The goal is to 
demonstrate that hard clustering via Bayesian nonparametrics enjoys many properties of Bayesian techniques 
(unlike k-means) but features the speed and scalability of k-means. 
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Data set 


DP-means 


k-means 


Gibbs 


Wine 


.41 


.43 


.42 


Iris 


.75 


.76 


.73 


Pima 


.02 


.03 


.02 


Soybean 


.72 


.66 


.73 


Car 


.07 


.05 


.15 


Balance Scale 


.17 


.11 


.14 


Breast Cancer 


.04 


.03 


.04 


Vehicle 


.18 


.18 


.17 



Table 1: Average NMI scores on a set of UCI data sets. Note that Gibbs sampling is handicapped since we 
utilize a vahdation set for parameter tuning. 

6.1 Setup 

Throughout the experiments, we utilize normalized mutual information (NMI) between ground truth clusters 
and algorithm outputs for evaluation, as is common for clustering applications (it also allows us to compare 
results when the number of outputted clusters does not match the number of clusters in the ground truth). 
Regarding parameter selection, there are various potential ways of choosing A; for clarity in making compar- 
isons to k-means we fix k (and g) and then find a suitable A. In particular, we found that a simple farthest-first 
heuristic is effective, and we utilize this approach in all experiments. Given an (approximate) number of 
desired clusters k, we first initialize a set T with the global mean. We iteratively add to T by finding the point 
in the data set which has the maximum distance to T (the distance to T is the smallest distance among points 
in T). We repeat this k times and return the value of the maximum distance to T in round k as A. We utilize 
a similar procedure for the hard HDP, except that for A^ we average the values of the above procedure over 
all data sets, and for Xg we replace distances of points to elements of T with sums of distances of points in 
a data set to elements of T. For Gibbs sampling, we consider the model where the covariances are fixed to 
crl, there is a zero-mean pi Gaussian prior on the means, and an inverse-Gamma prior on a. (For the bench- 
mark data, we considered selection of cr based on cross-validation, as it yielded better results, though this is 
against the Bayesian spirit.) We set p — 100 throughout our experiments. We also consider two strategies 
for determining a: one where we place a gamma prior on a, as is standard for DP mixtures [6i, and another 
where we choose a via a validation set. 

6.2 DP-means Results 

We begin with a simple illustration of some of the key properties of our approach on a synthetic data set of 
three Gaussians, shown in Figure[2^. When we run DP-means on this data, the algorithm terminates within 8 
iterations with an average NMI score of .89 (based on 100 runs). In contrast, Figure|2]3 shows the NMI scores 
of the clusterings produced by two Gibbs runs (no burn-in) over the first 5000 iterations, one that learns a via 
a gamma prior, and another that uses a validation set to tune a. The learning approach does well around 1500 
iterations, but eventually more than three clusters are produced, leading to poor results on this data set. The 
validation approach yields three clusters, but it takes approximately 3000 iterations before Gibbs sampling is 
able to converge to three clusters (in contrast, it typically requires only three iterations before DP-means has 
reached an NMI score above .8). Additionally, we plot the number of clusters produced by DP-means as a 
function of A in Figure [2];; here we see that there is a large interval of A values where the algorithm returns 
three clusters. Note that all methods are initialized with all points in a single cluster; we fully expect that 
better initialization would benefit these algorithms. 

Next we consider a benchmarking comparison among k-means, DP-means, and Gibbs sampling to demon- 
strate comparable accuracies among the methods. We selected 8 common UCI data sets, and used class labels 
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as the ground-truth for clusters. Each data set was randomly split 30/70 for validation/clustering (we stress 
that validation is used only for Gibbs sampling). On the validation set, we validated both a and a, which 
yielded the best results. We ran the Gibbs sampler for 1000 burn-in iterations, and then ran for 5000 iterations, 
selecting every 10 samples. The NMl is computed between the ground-truth and the computed clusters, and 
results are averaged over 10 runs. The results are shown in Table[T] We see that, as expected, the results are 
comparable among the three algorithms: DP -means achieves higher NMl on 5 of 8 data sets in comparison 
to k-means, and 4 of 8 in comparison to Gibbs sampling. 

To demonstrate scalability, we additionally ran our algorithms over the 312,320 image patches of the 
Photo Tourism data set 1 15 1, a common vision data set. Each patch is 128-dimensional. Per iteration, the DP- 
means algorithm and the Gibbs sampler require similar computational time (37.9 seconds versus 29.4 seconds 
per iteration). However, DP-means converges fully in 63 iterations, whereas obtaining full convergence of 
the Gibbs sampler is infeasible on this data set. 

6.3 Hard Gaussian HDP Results 

As with DP-means, we demonstrate results on synthetic data to highlight the advantages of our approach as 
compared to the baselines. We generate parameters for 15 ground-truth Gaussian distributions (means are 
chosen uniformly in [0, 1]^ and covariances are .01 • /). Then we generate 50 data sets as follows: for each 
data set, we choose 5 of the 15 Gaussians at random, and then generate 25 total points from these chosen 
Gaussians (5 points per Gaussian). An example of one of the 50 data sets is shown in Figure |2}l; in many 
cases, it is difficult to cluster the data sets individually, as shown in the figure. 

Our goal is to find shared clusters in this data. To evaluate the quality of results, we compute the NMl 
between the ground-truth and the outputted clusters, for each data set, and average the NMl scores across the 
data sets. As a baseline, we run k-means and DP-means on the whole data set all at once (i.e., we treat all 
twenty data sets as one large data set) as well as k-means and DP-means on the individual data sets, k-means 
on the whole data set obtains an average NMl score of .77 while DP-means yields .73. When we run the 
hard Gaussian HDP, we obtain 17 global clusters, and each data set forms on average 4.4 local clusters per 
data set. The average NMl for this method is .81, significantly higher than the non-hierarchical approaches. 
When we run k-means or DP-means individually on each data set and compute the average NMl, we obtain 
scores of .79 for both; note that there is no automatic cluster sharing via this approach. The hard Gaussian 
HDP takes 28.8s on this data set, versus 2.7s for k-means on the full data. 

7 Conclusions and Open Problems 

This paper outlines connections arising between DP mixture models and hard clustering algorithms, and de- 
velops scalable algorithms for hard clustering that retain some of the benefits of Bayesian nonparametric and 
hierarchical modeling. Our analysis is only a first step, and we note that there are several avenues of future 
work, including i) improvements to the basic algorithms using ideas from k-means, such as local search ||4l, 
ii) spectral or semidefinite relaxations for the hard Gaussian HDP, iii) extensions to other Bayesian nonpara- 
metric processes such as the Pitman-Yor process 19] [ij, iv) generalizations to exponential family mixture 
models HI, and v) additional comparisons to sampling-based and variational inference methods. 

Acknowledgements. We thank Trevor Darrell and the anonymous reviewers for the corresponding ICML 
paper for helpful suggestions. 
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